生信软件：diamond

您所在的位置：网站首页 › diamond classic › 生信软件：diamond

生信软件：diamond

2024-07-14 18:50| 来源: 网络整理| 查看: 265

介绍：

diamond软件是NCBI推出的blast+升级版本（应该是的）。

功能：

寻找同源序列

使用：

1.建库

diamond makedb --in nr.faa -d nr

--in 填写建库所需的序列文件，fasta格式(.faa一般指蛋白序列文件) -d 索引的前缀名，生成后缀为.dmnd的文件

2.序列比对

diamond blastp -d nr -q gene.fasta -o matches.m8 -f 6 -p 50 --more-sensitive -e 1e-5

--db/-d 输入比对数据库 --query/-q 比对序列 --threads/-p 线程数 --out/-o 输出文件 --outfmt/-f 输出文件格式，默认6（表格） --evalue/-e 比对的最大evalue值（默认0.001) --max-target-seqs/-k 比对到的最大序列数，默认值是25 --more-sensitive 比对准确度更高

其他参数： --top 百分数的形式表示--max-target-seqs --min-score 最小评分 --id 给出指定百分比的数据 --subject-cover 最小覆盖度 --unal (0,1) 是否输出未比对上的reads（0=no, 1=yes） --sensitive 建议对齐较长的序列 --block-size/b，一次处理的十亿碱基的大小，主要控制内存使用，默认为2（预计使用此内存数量的大约六倍，即默认内存使用将到达12G），转录流程使用0.2 --salltitles 将全长标题包含在DAA格式中，默认DAA文件仅包含缩短序列ID（直到第一个空白字符）

3.比对结果

image.png 别人简书里的图片，默认输出格式

每一列是什么呢？ BLASTn output format 6

Column headers: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

1. qseqid query sequence id 2. sseqid subject (e.g., reference genome) sequence id 3. pident percentage of identical matches 4. length alignment length 5. mismatch number of mismatches 6. gapopen number of gap openings 7. qstart start of alignment in query 8. qend end of alignment in query 9. sstart start of alignment in subject 10. send end of alignment in subject 11. evalue expect value 12. bitscore bit score

自定义输出结果

--outfmt 6 qseqid sseqid pident qlen length mismatch gapope evalue bitscore

diamond输出格式：

0 BLAST pairwise format. 5 BLAST XML format. 6 表格模式 (默认输出格式). 100 DIAMOND 101 SAM format. 102 Taxonomic classification. 103 PAF format.

4.结果筛选（linux）

#选取identity>80的结果 awk -F"\t" '$3>80' matches.m8 > matches.m8.identity80

附件： diamond 帮助文档

Licensed under the GNU AGPL Check http://github.com/bbuchfink/diamond for updates. Syntax: diamond COMMAND [OPTIONS] Commands: makedb Build DIAMOND database from a FASTA file blastp Align amino acid query sequences against a protein reference database blastx Align DNA query sequences against a protein reference database view View DIAMOND alignment archive (DAA) formatted file help Produce help message version Display version information getseq Retrieve sequences from a DIAMOND database file dbinfo Print information about a DIAMOND database file General options: --threads (-p) number of CPU threads --db (-d) database file --out (-o) output file --outfmt (-f) output format 0 = BLAST pairwise 5 = BLAST XML 6 = BLAST tabular 100 = DIAMOND alignment archive (DAA) 101 = SAM Value 6 may be followed by a space-separated list of these keywords: qseqid means Query Seq - id # Query序列id qlen means Query sequence length # Query序列长度 sseqid means Subject Seq - id # Subject序列id sallseqid means All subject Seq - id(s), separated by a ';' # 所有Subject序列id，以“；”分隔 slen means Subject sequence length # Subject序列长度 qstart means Start of alignment in query # Query对齐序列起点 qend means End of alignment in query # Query对齐序列终点 sstart means Start of alignment in subject # Subject对齐序列起点 send means End of alignment in subject # Subject对齐序列终点 qseq means Aligned part of query sequence # Query序列对齐部分 sseq means Aligned part of subject sequence # Subject序列对齐部分 full_sseq means Full subject sequence # 完整的Subject序列 evalue means Expect value # 期望值 bitscore means Bit score # 比特得分 score means Raw score # 原始分数 length means Alignment length # 比对长度 pident means Percentage of identical matches # 相同匹配的百分比 nident means Number of identical matches # 相同匹配的数目 mismatch means Number of mismatches # 不匹配的数目 positive means Number of positive - scoring matches # 相似匹配的数目 gapopen means Number of gap openings # 间隙开口的数目 gaps means Total number of gaps # 间隙的总数 ppos means Percentage of positive - scoring matches # 相似匹配的百分比 qframe means Query frame # Query的翻译类型 btop means Blast traceback operations(BTOP) # blast回溯操作 staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order) # Subject的物种id，以“；”分隔 stitle means Subject Title # Subject的题目(>右侧所有内容) salltitles means All Subject Title(s), separated by a '' # 所有Subject题目(>右侧所有内容)，以“”分隔 qcovhsp means Query Coverage Per HSP # Query每个HSP（匹配的一段序列）的覆盖度 qtitle means Query title # Query的题目(>右侧所有内容) Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore # 默认输出的内容 --verbose (-v) verbose console output --log enable debug log --quiet disable console output Makedb options: --in input reference file in FASTA format Aligner options: --query (-q) input query file --strand query strands to search (both/minus/plus) --un file for unaligned queries --unal report unaligned queries (0=no, 1=yes) --max-target-seqs (-k) maximum number of target sequences to report alignments for --top report alignments within this percentage range of top alignment score (overrides --max-target-seqs) --range-culling restrict hit culling to overlapping query ranges --compress compression for output files (0=none, 1=gzip) --evalue (-e) maximum e-value to report alignments (default=0.001) --min-score minimum bit score to report alignments (overrides e-value setting) --id minimum identity% to report an alignment --query-cover minimum query cover% to report an alignment --subject-cover minimum subject cover% to report an alignment --sensitive enable sensitive mode (default: fast) --more-sensitive enable more sensitive mode (default: fast) --block-size (-b) sequence block size in billions of letters (default=2.0) --index-chunks (-c) number of chunks for index processing --tmpdir (-t) directory for temporary files --gapopen gap open penalty --gapextend gap extension penalty --frameshift (-F) frame shift penalty (default=disabled) --matrix score matrix for protein alignment (default=BLOSUM62) --custom-matrix file containing custom scoring matrix --lambda lambda parameter for custom matrix --K K parameter for custom matrix --comp-based-stats enable composition based statistics (0/1=default) --masking enable masking of low complexity regions (0/1=default) --query-gencode genetic code to use to translate query (see user manual) --salltitles include full subject titles in DAA file --sallseqid include all subject ids in DAA file --no-self-hits suppress reporting of identical self hits --taxonmap protein accession to taxid mapping file --taxonnodes taxonomy nodes.dmp from NCBI --taxonlist restrict search to list of taxon ids (comma-separated) Advanced options: --algo Seed search algorithm (0=double-indexed/1=query-indexed) --bin number of query bins for seed search --min-orf (-l) ignore translated sequences without an open reading frame of at least this length --freq-sd number of standard deviations for ignoring frequent seeds --id2 minimum number of identities for stage 1 hit --window (-w) window size for local hit search --xdrop (-x) xdrop for ungapped alignment --ungapped-score minimum alignment score to continue local extension --hit-band band for hit verification --hit-score minimum score to keep a tentative alignment --gapped-xdrop (-X) xdrop for gapped alignment in bits --band band for dynamic programming computation --shapes (-s) number of seed shapes (0 = all available) --shape-mask seed shapes --index-mode index mode (0=4x12, 1=16x9) --rank-ratio include subjects within this ratio of last hit (stage 1) --rank-ratio2 include subjects within this ratio of last hit (stage 2) --max-hsps maximum number of HSPs per subject sequence to save for each query --range-cover percentage of query range to be covered for hit culling (default=50) --dbsize effective database size (in letters) --no-auto-append disable auto appending of DAA and DMND file extensions --xml-blord-format Use gnl|BL_ORD_ID| style format in XML output View options: --daa (-a) DIAMOND alignment archive (DAA) file --forwardonly only show alignments of forward strand Getseq options: --seq Sequence numbers to display.

参考文档： https://www.jianshu.com/p/199824ca5234 https://www.jianshu.com/p/2c4c53b74594

【本文地址】

生信软件：diamond

生信软件：diamond

今日新闻

推荐新闻